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In  recent  years,  there  has  been  increased  interest  in  the  development  and  use  of  quantitative  structure 
activity/property  relationship  (QSAR/QSPR)  models.  For  the  most  part,  this  is  due  to  the  fact  that  experimental  data 
is  sparse  and  obtaining  such  data  is  costly,  while  theoretical  structural  descriptors  can  be  obtained  quickly  and 
inexpensively.  In  this  study,  three  linear  regression  methods,  viz.  principal  component  regression  (PCR),  partial  least 
squares  (PLS),  and  ridge  regression  (RR),  were  used  to  develop  QSPR  models  for  the  estimation  of  human  blood:air 
partition  coefficient  (logPbIood:air)  for  a  group  of  3 1  diverse  low-molecular  weight  volatile  chemicals  from  their 
computed  molecular  descriptors.  In  general,  RR  was  found  to  be  superior  to  PCR  or  PLS.  Comparisons  were  made 
between  models  developed  using  parameters  based  solely  on  molecular  structure  and  linear  regression  (LR)  models 
developed  using  experimental  properties,  including  saline:air  partition  coefficient  (logPsaiinc:)ur)  and  olive  oikair 
partition  coefficient  (logPolive  oil:air),  as  independent  variables,  indicating  that  the  structure-property  correlations  are 
comparable  to  the  property-property  correlations.  The  best  models,  however,  were  those  which  used  rat  logPbioo<i:air 
as  the  independent  variable.  Haloalkane  subgroups  were  modeled  separately  for  comparative  purposes,  and  although 
models  based  on  the  cdngeneric  compounds  were  superior,  the  models  developed  on  the  complete  set  of  diverse 
compounds  were  of  acceptable  quality.  The  structural  descriptors  were  placed  into  one  of  three  classes  based  on 
level  of  complexity:  Topostructural  (TS),  topochemical  (TC),  or  3-dimensional  /  geometrical  (3D).  Modeling  was 
performed  using  the  structural  descriptor  classes  both  in  a  hierarchical  fashion  and  separately.  The  results  indicate 
that  the  highest  quality  structure-based  models,  in  terms  of  descriptor  classes,  were  those  derived  using  TC  or 
TS+TC  descriptors. 

Key  Words:  Blood:air  partition  coefficient;  PBPK  model;  theoretical  molecular  descriptors;  ridge  regression, 
quantitative  structure-property  relationship  (QSPR)  model. 
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1.  INTRODUCTION 

Modem  lifestyle  worldwide  is  based  on  the  use  of  a  large  number  of  chemicals.  Natural  and  synthetic 
chemicals  are  used  as  drugs,  pesticides,  herbicides,  components  of  diagnostic  tools,  ingredients  and 
solvents  in  industrial  processes,  to  name  just  a  few.  The  Toxic  Substances  Control  Act  (TSCA)  Inventory 
maintained  by  the  United  States  Environmental  Protection  Agency  (USEPA)  currently  has  over  81,000 
entries  and  the  list  is  growing  every  year.(I)  Many  of  these  chemicals  are  used  for  various  purposes  and 
have  the  potential  to  be  released  in  the  environment.  Therefore,  it  is  natural  that  we  need  to  carry  out  risk 
assessment  of  the  TSCA  chemicals,  particularly  for  those  that  are  used  frequently  and  in  large  quantities. 
Volatile  organic  chemicals  (VOCs)  constitute  a  class  of  chemicals  that  are  frequently  used  in  various 
industrial  processes.  Therefore,  there  is  an  interest  to  predict  the  potential  adverse  effects  of  these 
chemicals  on  human  and  environmental  health.  The  overall  risk  of  a  chemical  is  determined  primarily  by 
its  intrinsic  toxicity  (hazard)  and  exposure  potential. 

The  blood:  air  partition  coefficient  of  VOCs  is  an  important  determinant  of  pulmonary  uptake  of  such 
chemicals  from  inhaled  air.  Such  parameters  are  routinely  used  in  building  physiologically-based 
pharmacokinetic  (PBPK)  models  for  exposure  assessment  of  such  chemicals.  Solubility  of  VOCs  in  blood 
is  determined  by  its  composition  including  the  content  of  neutral  lipid,  phospholipid,  and  water,  as  well  as 
the  extent  of  binding  of  these  chemicals  to  specific  components  such  as  plasma  proteins  and 
hemoglobin.®  Such  physicochemical  considerations  can  be  used  to  come  up  with  physicochemically- 
based  methods  for  the  estimation  of  partition  coefficient  values  of  chemicals.  The  other  possibility  is  the 
use  of  molecular  descriptors  to  estimate  partition  coefficient  of  chemicals  directly  from  their  structure. 
Such  quantitative  structure-activity/property  relationship  (QSAR/QSPR)  methods  derived  using 
theoretical  descriptors  are  based  on  the  idea  that  observable  physicochemical  and  biological  properties  of 
chemicals  are  determined  by  their  molecular  structure.  In  particular,  QSPRs  have  been  found  to  be  useful 
in  the  estimation  of  physicochemical  properties  such  as  octanol:  water  partition  coefficient  of  various 
groups  of  chemicals, (3>4)  as  well  as  the  degree  of  transport  through  the  blood-brain  barrier®  and  skin,®  of 
various  congeneric  and  diverse  sets. 

While  some  quantitative  models  use  experimental  data  per  se  as  independent  variables,  it  is  important 
to  note  that  experimental  data  does  not  exist  for  the  majority  of  compounds,  and  obtaining  such  data  is 
costly  in  terms  of  time  and  monetary  resources.  Computational  modeling  involving  algorithmically 
calculated  parameters  based  solely  on  molecular  structure  is  an  inexpensive  alternative.  In  this  paper,  we 
have  attempted  to  develop  QSPR  models  to  estimate  human  bIood:air  partition  coefficients  for  a  set  of  3 1 
VOCs  using  molecular  descriptors  which  can  be  computed  directly  from  molecular  structure. 
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2.  METHODS 

2.1  Database.  Liquid:air  partition  coefficients  were  experimentally  determined  by  Gargas  et  alP }  using  a 
modified  version  of  the  gas-phase  vial  equilibrium  technique(8)  for  a  set  of  low  molecular-weight  volatile 
chemicals.  Table  I  includes  experimentally  determined  human  and  male  Fischer  344  rat  blood:air  partition 
coefficient  data  for  a  set  of  31  chemicals  including  18  haloalkanes,  2  nitroalkanes,  2  aliphatic 
hydrocarbons,  4  haloalkenes,  and  5  aromatics  compounds.  The  human  blood: air  partition  coefficient 
values  were  determined  on  blood  pretreated  with  diethyl  maleate  to  inhibit  an  observed  glutathione 
transferase  reaction.  Experimental  saline:air  and  olive  oikair  partition  coefficients,  determined  by  Gargas 
et  al.,  are  also  listed  in  Table  I.  All  experimental  values  were  obtained  at  37  °C. 

It  should  be  noted  that  the  data  used  in  the  current  study  are  a  subset  of  that  reported  by  Gargas  et  al.(7) 
Two  cis/trans  isomers  were  eliminated  because  they  are  indistinguishable  in  terms  of  their  calculated 
molecular  descriptors  based  on  SMILES  input.  Methyl  chloride  was  also  removed  from  the  data  set  as  it 
is  not  possible  to  calculate  our  entire  set  of  theoretic  descriptors  on  two-atom  compounds.  In  addition, 
two  compounds  were  reported  without  discrete  values  for  0.9%  saline:air  partition  coefficient  and  thus 
were  not  included  in  this  study. 

2.2  Theoretical  Molecular  Descriptors.  Theoretical  molecular  descriptors  may  be  divided  into 
hierarchical  classes  based  upon  level  of  complexity.  Topostructural  (TS)  descriptors,  which  encode 
information  strictly  on  the  adjacency  and  connectedness  of  atoms  within  a  molecule,  make  up  the  simplest 
of  the  hierarchical  classes.  Topochemical  (TC)  descriptors  encode  information  related  to  the  chemical 
nature  of  a  molecule  ipcluding  bond  type.  The  3-dimensional  or  shape  descriptors  (3D)  are  still  more 
complex,  encoding  information  about  the  3 -dimensional  aspects  of  a  molecule.  Calculated  IogP„.oct:moi:water 
descriptors^  were  included  at  the  final  stage  of  hierarchical  model  development.  The  topostructural  and 
topochemical  descriptors  are  collectively  referred  to  as  topological  descriptors. 

Descriptors  used  in  the  present  study  were  derived  from  molecular  structure  using  software  packages 
including  POLLY, (l0)  Triplet,01’ 12)  and  Molconn-Z.(13)  From  POLLY,  a  set  of  topological  descriptors  is 
available,  including  a  large  group  of  connectivity  indices, (14'17)  path-length  descriptors, (14)  and  information 
theoretic08’ 19^  and  neighborhood  complexity  indices.09^  The  Triplet  descriptors  also  constitute  a  large 
group  of  topological  parameters.  They  are  derived  from  a  matrix,  a  main  diagonal  column  vector,  and  a 
free  term  column  vector,  converting  the  matrix  into  a  system  of  linear  equations  whose  solutions  are  the 
local  vertex  invariants.  These  local  vertex  invariants  are  then  used  in  the  following  mathematical 
operations  in  order  to  obtain  the  triplet  descriptors: 
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1.  Summation,  EjXj 

2.  Summation  of  squares,  EjXj2 

3.  Summation  of  square  roots,  EjXj1/2 

4.  Sum  of  inverse  square  root  of  cross-product  over  edges  ij,  Ejj(XjXj)  "1/2 

5.  Product,  N(EjXj)1/N 

Molconn-Z  provides  additional  topological  descriptors,  including  an  extended  set  of  connectivity  indices, 
electrotopological  indices/20, 21)  and  hydrogen  bonding  descriptors,  as  well  as  a  small  set  of  molecular 
shape  descriptors. 

H-Bond,  a  software  program  developed  by  Basak/22^  was  used  to  calculate  HBi,  a  measure  of 
hydrogen  bonding  potential.  Balaban’s  J  indices  were  also  calculated  by  software  developed  by  the 
authors.(23-25) 

LogP^ctanoi  water  values  were  calculated  by  the  LogP  program(9)  and  are  included  in  Table  I.  Table  II 
provides  a  brief  description  of  all  other  theoretical  molecular  descriptors  used  in  the  current  study,  though 
the  calculated  values  for  these  descriptors  are  not  included  for  the  sake  of  brevity. 


2.3  Statistical  Analysis.  Independent  and  dependent  variables  were  scaled  by  the  natural  logarithm,  as 
their  respective  ranges  differed  by  several  orders  of  magnitude.  The  CORR  procedure  of  the  SAS 
statistical  package(26)  was  used  to  identify  perfectly  correlated  descriptors,  i.e.  r  =  1.0.  In  each  case,  only 
one  descriptor  of  a  perfectly  correlated  pair  was  retained  for  use  in  the  subsequent  analysis.  Any 
descriptor  that  either  had  a  value  of  zero  for  all  compounds  in  the  data  set  or  could  not  be  calculated  for 
all  compounds  in  the  data  set  was  removed. 

The  structure-property  models  were  developed  using  ridge  regression  (RR),(27)  principal  components 
regression  (PCR),(28)  and  partial  least  squares  (PLS)  regression*29'3^  methodologies,  utilizing  molecular 
descriptors  in  a  hierarchical  fashion.  In  addition,  each  class  of  descriptors  was  used  independently  to 
obtain  single-class  models.  RR,  PCR,  and  PLS  are  useful  in  cases  wherein  the  number  of  descriptors  is 
much  greater  than  the  number  of  observations,  as  well  as  in  cases  where  the  independent  variables  are 
highly  intercorrelated.  In  addition,  these  regression  methods  make  use  of  all  independent  variables  as 
opposed  to  subset  regression  wherein  it  is  possible  that  important  parameters  may  be  eliminated  from  the 
study.  Linear  regression  (LR)  was  used  to  obtain  the  property-property  models,  which  involve  1-2 
independent  variables.  Statistical  parameters  reported  include  the  cross-validated  R2  value  and  the  PRESS 
statistic  which  are  reliable  measures  of  model  predictability.  In  addition,  the  t  values  can  be  examined  in 
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order  to  identify  significant  descriptors.  Although  a  descriptor  with  a  large  1 1  |  indicates  that  the 
associated  descriptor  is  important  in  the  model,  it  should  be  cautioned  that  the  reverse  is  not  necessarily 

true. 

Honest  assessment  of  the  quality  of  a  prediction  model  is  seldom  straightforward,  but  is  particularly 
challenging  in  a  situation  such  as  this  where  the  number  of  independent  variables  far  exceeds  the  number 
of  observations/32,3^  In  these  cases,  conventional  regression  measures  such  as  R2  are  useless.  The 
measure  we  use  is  the  cross-validation  (or  jack-knife)  sum  of  squares.  For  this  measure,  each  compound 
in  turn  is  omitted  from  the  data  set,  and  the  coefficients  of  the  regression  model  (RR,  PLS  or  PCR) 
computed  using  the  remaining  n-1  cases.  These  coefficients  are  used  to  predict  the  hold-out  case.  The 
overall  quality  of  the  fit  is  measured  by  the  prediction  sum  of  squares  PRESS  -  the  sum  of  squares  of  the 
difference  between  the  actual  observed  activity  and  that  predicted  from  the  regression.  A  cross-validation 
R2  can  be  defined  by 

PRESS 

SSTotal 

Unlike  R2,  this  i?2v  does  not  increase  if  irrelevant  predictors  are  added  to  the  model;  rather  it  tends  to 
decrease.  And  where  R2  is  necessarily  non-negative,  R1^  may  be  negative.  This  non-uncommon 

situation  is  an  indication  that  the  model  fitted  is  poor  -  worse,  in  fact,  than  making  predictions  by 
ignoring  the  predictors  and  using  the  mean  activity  as  the  prediction  in  all  circumstances. 

mimics  the  results  of  applying  the  final  regression  to  predicting  a  future  case;  large  values  can  be 

interpreted  unequivocally  and  without  regard  to  either  the  number  of  cases  or  predictors  as  indicating  that 
the  fitted  regression  will  accurately  predict  the  activity  of  future  compounds  of  the  same  chemical  type  as 
those  used  to  calibrate  the  regression. 


3.  RESULTS  AND  DISCUSSION 

Table  III  provides  results  of  studies  done  on  the  complete  set  of  3 1  diverse  compounds  as  well  as  the 
subset  of  18  haloalkanes  for  the  prediction  of  human  logPbiood  air-  Examining  the  models  developed  using 
structural  descriptors,  we  find  that  the  RR  methodology  is  generally  superior  to  both  PCR  and  PLS.  This 
is  supported  by  our  earlier  studies  with  various  congeneric  and  diverse  sets  of  chemicals/34  36)  The  model 
developed  using  TC  descriptors  as  independent  variables  was  superior  to  those  developed  with  other 
structural  descriptor  classes  in  the  analysis  of  the  31  diverse  compounds,  while  the  TS+TC  model  was 
superior  in  the  analysis  of  the  1 8  haloalkanes. 
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The  results  of  QSPRs  reported  in  this  paper  show  that  structure-property  correlations  are  comparable 
or  superior  to  property-property  correlations  involving  experimental  saline:air  and  olive  oihair  partition 
coefficients  in  the  prediction  of  human  blood:air  partition  coefficient.  For  the  set  of  31  diverse  chemicals, 
a  cross-validated  R2  of  0.874  and  a  PRESS  of  7.79  is  obtained  for  the  TC  model,  while  the  property- 
property  model  utilizing  logPsaii„e:air  and  logP0iive:oii  air  yields  a  cross-validated  R2  of  0.889  with  a  PRESS  of 
6.19  (Table  III).  For  the  set  of  18  haloalkanes,  the  TS+TC  models  yields  a  cross-validated  R2  of  0.897 
with  a  PRESS  of  3. 02,  while  the  property-property  model  utilizing  logPsaiineairand  logP0ijve:0ii  air  yields  a 
cross-validated  R2  of  0.846  with  a  PRESS  of  4.50.  However,  property-property  models  in  which  rat 
logPbiood:air  is  used  to  predict  human  logPbi0od:air  are  superior  to  those  in  which  either  logPsa]jne;air  and 
logPohve  oiiairOr  structural  parameters  are  used  as  predictors;  with  a  cross-validated  R2  of  0.963  and  PRESS 
of  2.25  for  the  full  set  of  31  compounds,  and  a  cross-validated  R2  of  0.961  and  PRESS  of  1.16  for  the 
subset  of  1 8  haloalkanes. 

It  is  clear  from  the  results  presented  in  Table  III  that  experimental  rat  blood:air  partition  coefficient  is 
the  best  predictor  of  human  blood:air  partition  coefficient.  Acquiring  these  data,  however,  is  time 
consuming  and  requires  laboratory  testing  resources  along  with  the  sacrifice  of  animals.  Experimental 
determination  of  rat  blood:  air  partition  coefficient  of  hundreds  or  thousands  of  candidate  chemicals  would 
be  a  daunting  task.  The  theoretical  descriptor-based  models,  on  the  other  hand,  can  provide  reasonable 
estimates  very  quickly  and  at  a  low  cost. 

Ridge  regression  coefficients  and  standard  errors  for  the  top  10  descriptors  based  on  1 1 1  values  for  the 
human  logPbi0od:airTC  model  based  on  the  set  of  3 1  diverse  chemicals  are  provided  in  Table  IV .  The 
indices  most  important  for  the  prediction  of  human  logPbi00d:air  include:  a)  molecular  weight  (fw), 
quantifying  molecular  size,  b)  triplet  indices  (AZVy),  encoding  information  about  the  nature  of  atoms,  c) 
electrotopological  state  indices  (SdO,  SddSN,  SSBr),  which  are  numerical  descriptors  of  the  electronic 
states  of  atoms,  d)  valence  and  bonding  connectivity  indices  ( 'yb,  V ),  which  quantify  structural 
information  regarding  molecular  size  and  shape,  and  e)  a  hydrogen  bonding  parameter  (HBi).  The 
important  role  of  molecular  factors  such  as  size,  electronic  interactions,  and  hydrogen  bonding  in 
determining  partition  coefficients  of  chemicals  is  evident  from  our  earlier  studies*'  ’  1  and  those  of  Kamlet 
etal.(38) 

It  is  important  to  reiterate  that  model  predictability  is  best  judged,  not  with  a  fitted  model,  but  with  a 
cross-validated  model  wherein  each  of  the  compounds,  in  turn,  is  omitted  from  the  data  set  and  its  value 
then  determined  by  the  coefficients  of  the  remaining  n-1  compounds.  In  this  way,  we  have  an  accurate,  if 
not  conservative,  indication  of  how  well  the  model  will  predict  property  values  of  new  compounds  which 
are  similar  to  those  used  to  create  the  model.  Figure  1  illustrates  the  relationship  between  the  fitted  and 
experimental  human  logPbi0od:air  values  using  the  TC  model  for  the  set  of  3 1  diverse  compounds.  All 
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statistical  values  reported  in  this  paper,  however,  are  based  on  cross-validated  results.  Accordingly, 
Figure  2  illustrates  the  relationship  between  the  cross-validated  predicted  and  experimental  human 
logPbiood:air  values  using  the  TC  model  for  the  set  of  31  diverse  compounds. 

In  conclusion,  the  models  based  on  rat  logPbi00d:air  are  superior  to  any  of  the  structure-based  models.  It 
is  important  to  note,  however,  that  experimental  data  are  not  currently  available  for  the  majority  of 
compounds;  and  obtaining  this  data  is  costly  in  terms  of  time  and  monetary  resources.  In  contrast,  we  are 
able  to  obtain  reasonably  good  models  using  structural  descriptors  that  can  be  calculated  very  quickly  and 
inexpensively  for  both  existing  and  unsynthesized  chemicals.  Modeling  based  on  structural  descriptors 
also  promotes  an  understanding  of  the  theoretical  basis  of  properties  and  reduces  the  need  for  animal 
research,  an  area  to  which  a  growing  aversion  exists  in  our  society. 
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Table  II.  Symbols,  definitions  and  classification  of  calculated  molecular  descriptors 
_ Topostructural  (TS) _ 


1^  Information  index  for  the  magnitudes  of  distances  between  all  possible  pairs  of  vertices  of  a 

graph 


1^  Mean  information  index  for  the  magnitude  of  distance 

W  Wiener  index  =  half-sum  of  the  off-diagonal  elements  of  the  distance  matrix  of  a  graph 

ID  Degree  complexity 

Hv  Graph  vertex  complexity 

Hd  Graph  distance  complexity 


IC 


Mi 

M2 


Xc 

h„ 

Xpc 

h 

XCh 

Ph 

J 


Information  content  of  the  distance  matrix  partitioned  by  frequency  of  occurrences  of  distance 
h 

A  Zagreb  group  parameter  =  sum  of  square  of  degree  over  all  vertices 

A  Zagreb  group  parameter  =  sum  of  cross-product  of  degrees  over  all  neighboring  (connected) 
vertices 

Path  connectivity  index  of  order  h  =  0-10 
Cluster  connectivity  index  of  order  h  =  3-6 
Path-cluster  connectivity  index  of  order  h  =  4-6 
Chain  connectivity  index  of  order  h  =  3-10 
Number  of  paths  of  length  h  =  0-10  « 

Balaban’s  J  index  based  on  topological  distance 


nrings  Number  of  rings  in  a  graph 

ncirc  Number  of  circuits  in  a  graph 

DN2Sy  Triplet  index  from  distance  matrix,  square  of  graph  order  (#  of  non-H  atoms),  and  distance 

sum;  operation  y  =  1-5 

DN2ly  Triplet  index  from  distance  matrix,  square  of  graph  order,  and  number  1 ;  operation  y  =  1-5 

ASly  Triplet  index  from  adjacency  matrix,  distance  sum,  and  number  1 ; 

operation  y  =  1-5 

DS  ly  Triplet  index  from  distance  matrix,  distance  sum,  and  number  1 ; 

operation  y  =  1-5 

ASNy  Triplet  index  from  adjacency  matrix,  distance  sum,  and  graph  order;  operation  y  =  1-5 

DSNy  Triplet  index  from  distance  matrix,  distance  sum,  and  graph  order; 

operation  y  =  1-5 

DN2Ny  Triplet  index  from  distance  matrix,  square  of  graph  order,  and  graph  order;  operation  y  =  1-5 

ANSy  Triplet  index  from  adjacency  matrix,  graph  order,  and  distance  sum;  operation  y  =  1-5 

ANly  Triplet  index  from  adjacency  matrix,  graph  order,  and  number  1 ; 

operation  y  =  1-5 

ANNy  Triplet  index  from  adjacency  matrix,  graph  order,  and  graph  order  again;  operation  y  =  1  -5 

ASVy  Triplet  index  from  adjacency  matrix,  distance  sum,  and  vertex  degree;  operation  y  =  1-5 

DSVy  Triplet  index  from  distance  matrix,  distance  sum,  and  vertex  degree;  operation  y  =  1-5 

ANVy  Triplet  index  from  adjacency  matrix,  graph  order,  and  vertex  degree;  operation  y  =  1-5 


O 

00rb 


ICr 

SICr 


_ Topochemical  (TUI _ 

Order  of  neighborhood  when  ICr  reaches  its  maximum  value  for  the  hydrogen-filled  graph 
Order  of  neighborhood  when  ICr  reaches  its  maximum  value  for  the  hydrogen-suppressed 
graph 

Information  content  or  complexity  of  the  hydrogen-suppressed  graph  at  its  maximum 
neighborhood  of  vertices 

Mean  information  content  or  complexity  of  a  graph  based  on  the  r*  (r  =  0-6)  order 
neighborhood  of  vertices  in  a  hydrogen-filled  graph 

Structural  information  content  for  r*  (r  =  0-6)  order  neighborhood  of  vertices  in  a  hydrogen- 
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CICr 


Y 

Y 
AC 

Y 

ACh 

hvb 

APC 

Y 

Y 

bvv 

ACh 

hvv 

APC 

JB 

Jx 

JY 

HB, 

AZVy 

AZSy 

ASZy 

AZNy 

ANZy 

DSZy 

DN2Zy 


nvx 

nelem 

fw 

si 

totop 

suml 

sumdell 

tets2 

phia 

IdCbar 

IdC 

Wp 

Pf 

Wt 

knotp 

knotpv 

nclass 

numHBd 

numwHBd 

numHBa 

SHCsats 

SHCsatu 

SHvin 

SHtvin 

SHavin 

SHarom 

SHHBd 


SHwHBd 


SHHBa 


filled  graph 

Complementary  information  content  for  r*  (r  =  0-6)  order  neighborhood  of  vertices  in  a 
hydrogen-filled  graph 

Bond  path  connectivity  index  of  order  h  =  0-6 
Bond  cluster  connectivity  index  of  order  h  =  3-6 
Bond  chain  connectivity  index  of  order  h  =  3-  6 
Bond  path-cluster  connectivity  index  of  order  h  =  4-6 
Valence  path  connectivity  index  of  order  h  =  0-10 
Valence  cluster  connectivity  index  of  order  h  =  3-6 
Valence  chain  connectivity  index  of  order  h  =  3-10 
Valence  path-cluster  connectivity  index  of  order  h  =  4-6 

Balaban’s  J  index  based  on  bond  types 
Balaban’s  J  index  based  on  relative  electronegativities 
Balaban’s  J  index  based  on  relative  covalent  radii 
Hydrogen  bonding  parameter 

Triplet  index  from  adjacency  matrix,  atomic  number,  and  vertex  degree;  operation  y  =  1-5 
Triplet  index  from  adjacency  matrix,  atomic  number,  and  distance  sum;  operation  y  =  1-5 
Triplet  index  from  adjacency  matrix,  distance  sum,  and  atomic  number;  operation  y  -  1-5 
Triplet  index  from  adjacency  matrix,  atomic  number,  and  graph  order;  operation  y  =  1-5 
Triplet  index  from  adjacency  matrix,  graph  order,  and  atomic  number;  operation  y  =  1-5 
Triplet  index  from  distance  matrix,  distance  sum,  and  atomic  number;  operation  y  =  1-5 
Triplet  index  from  distance  matrix,  square  of  graph  order,  and  atomic  number;  operation  y  =  1- 
5 

Number  of  non-hydrogen  atoms  in  a  molecule 

Number  of  elements  in  a  molecule 

Molecular  weight 

Shannon  information  index 

Total  Topological  Index  t 

Sum  of  the  intrinsic  state  values  I 

Sum  of  delta-I  values 

Total  topological  state  index  based  on  electrotopological  state  indices 
Flexibility  index  (kpl*  kp2/nvx) 

Bonchev-TrinajstiD  information  index 
Bon^hev-TrinajstiD  information  index 
Wienerp 
Plattf 

Total  Wiener  number 

Difference  of  chi-cluster-3  and  path/cluster-4 

Valence  difference  of  chi-cluster-3  and  path/cluster-4 

Number  of  classes  of  topologically  (symmetry)  equivalent  graph  vertices 

Number  of  hydrogen  bond  donors 

Number  of  weak  hydrogen  bond  donors 

Number  of  hydrogen  bond  acceptors 

E-State  of  C  sp3  bonded  to  other  saturated  C  atoms 

E-State  of  C  sp3  bonded  to  unsaturated  C  atoms 

E-State  of  C  atoms  in  the  vinyl  group,  =CH- 

E-State  of  C  atoms  in  the  terminal  vinyl  group,  =CH2 

E-State  of  C  atoms  in  the  vinyl  group,  =CH-,  bonded  to  an  aromatic  C 

E-State  of  C  sp2  which  are  part  of  an  aromatic  system 

Hydrogen  bond  donor  index,  sum  of  Hydrogen  E-State  values  for  -OH,  =NH, 

-NH2,  -NH-,  -SH,  and  #CH 

Weak  hydrogen  bond  donor  index,  sum  of  C-H  Hydrogen  E-State  values  for  hydrogen  atoms 
on  a  C  to  which  a  F  and/or  Cl  are  also  bonded 

Hydrogen  bond  acceptor  index,  sum  of  the  E-State  values  for  -OH,  =NH, 
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Qv 

NHBinty 

SHBinty 

-NH2,  -NH-,  >N-,  -0-,  -S-,  along  with  -F  and  -Cl 

General  Polarity  descriptor 

Count  of  potential  internal  hydrogen  bonders  (y  =  2-10) 

E-State  descriptors  of  potential  internal  hydrogen  bond  strength  (y  =2-10) 

Electrotopological  State  index  values  for  atoms  types: 

SHsOH,  SHdNH,  SHsSH,  SHsNH2,  SHssNH,  SHtCH,  SHother,  SHCHnX,  Hmax  Gmax, 

Hmin,  Gmin,  Hmaxpos,  Hminneg,  SsLi,  SssBe,  Sssss,Bem,  SssBH,  SsssB,  SssssBm,  SsCH3, 
SdCH2,  SssCH2,  StCH,  SdsCH,  SaaCH,  SsssCH,  SddC,StsC,  SdssC,  SaasC,  SaaaC,  SssssC, 
SsNH3p,  SsNH2,  SssNH2p,  SdNH,  SssNH,  SaaNH,  StN,  SsssNHp,  SdsN,  SaaN,  SsssN, 

SddsN,  SaasN,  SssssNp,  SsOH,  SdO,  SssO,  SaaO,  SsF,  SsSiH3,  SssSiH2,  SsssSiH,  SssssSi, 
SsPH2,  SssPH,  SsssP,  SdsssP,  SsssssP,  SsSH,  SdS,  SssS,  SaaS,  SdssS,  SddssS,  SssssssS,  SsCl, 
SsGeH3,  SssGeH2,  SsssGeH,  SssssGe,  SsAsH2,  SssAsH,  SsssAs,  SdsssAs,  SsssssAs,  SsSeH, 
SdSe,  SssSe,  SaaSe,  SdssSe,  SddssSe,  SsBr,  SsSnH3,  SssSnH2,  SsssSnH,  SssssSn,  Ssl, 

SsPbH3,  SssPbH2,  SsssPbH,  SssssPb 

Geometrical  /Shape  (3D) 

kpO 

Kappa  zero 

kpl-kp3 

Kappa  simple  indices 

kal-ka3 

Kappa  alpha  indices 
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Table  III.  Summary  statistics  of  predictive  models  for  human  logPbi<,od:aii 
thporp.tinal  structural  descriptors. 

based  on  experimental  properties  and 

A. 

31  DIVERSE  CHEMICALS 

Independent 

Variables 

RR 

PCR 

PLS 

LR 

R\.v. 

PRESS 

R2  C.V. 

PRESS 

R2  C.V. 

PRESS 

R2  cv.  PRESS 

Structural  descriptors 

TS 

0.257 

45.8 

-0.451 

89.4 

0.052 

58.4 

TS+TC 

0.846 

9.48 

0.165 

51.4 

0.677 

19.9 

TS+TC+3D 

0.827 

10.6 

0.140 

53.0 

0.620 

23.4 

TS+TC+3D+logPa 

0.835 

10.2 

0.112 

54.7 

0.652 

21.4 

TS 

0.257 

45.8 

-0.451 

89.4 

0.052 

58.4 

TC 

0.874 

7.79 

0.403 

36.8 

0.709 

17.9 

3D 

0.147 

52.6 

-0.013 

62.4 

-0.256 

77.4 

Properties _ 

LogP olive  oil:air  LogPsaiine:air 

Rat  logP^iood  :air 


0.899  6.19 

0.963  2.25 


B.  18  HALOALKANES 


RR 

PCR 

PLS 

LR 

Independent 

PRESS 

R2  cv.  PRESS 

R2  cv.  PRESS 

R2  cv.  PRESS 

Variables 


Structural  descriptors 


TS 

0.252 

22.0 

-1.53 

74.3 

-0.815 

53.2 

TS+TC 

0.897 

3.02 

0.825 

5.14 

0.678 

9.45 

TS+TC+3D 

0.892 

3.16 

0.856 

4.22 

0.702 

8.74 

TS+TC+3D+logPa 

'.0.892 

3.18 

0.856 

4.23 

0.704 

8.69 

TS 

.  0.252 

22.0 

-1.53 

74.3 

-0.815 

53.2 

TC 

,•  0.891 

3.21 

0.853 

4.32 

0.616 

11.3 

3D 

0.753 

7.24 

0.593 

11.9 

0.562 

12.9 

Properties 


LogP olive  oil:air  LogPsaline:air 

Rat  lOgPbloodrair 

0.846 

0.961 

4.50 

1.16 

“Calculated  logP„.0ctanoi:water;  values  included  in  Table  I. 
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Table  IV.  Ridge  regression  coefficient  and  standard  error  for  each  of  the  top  10  descriptors,  ranked  by  1 1 1,  in 
topochemical  model  for  the  prediction  of  human  logPbi0od:air,  n  =  31. 

Descriptor 

RR  coeff 

s.e. 

t 

SdO 

0.227 

0.021 

10.690 

HB! 

0.340 

0.032 

10.660 

SddsN 

-1.694 

0.159 

-10.640 

azv3 

0.130 

0.016 

8.000 

'xv 

0.345 

0.052 

6.670 

azv4 

0.224 

0.034 

6.580 

AZV! 

0.133 

0.024 

5.640 

SsBr 

0.238 

0.044 

5.390 

fw 

0.287 

0.054 

5.310 

!£ _ 

0.139 

0.028 

5.060 
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FIGURE  CAPTIONS 


Figure  1.  Experimental  vs  fitted  human  logPbiood:air  using  the  topochemical  (TC)  ridge  regression 
(RR)  model  for  the  set.of  3 1  diverse  compounds 


Figure  2.  Experimental  vs  cross-validated  predicted  human  logPbiood  air  using  the  topochemical 
(TC)  ridge  regression  (RR)  model  for  the  set  of  3 1  diverse  compounds 
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Figure  1. 
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